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The PITCHf/x database has allowed the statistical analysis of of Major 
League Baseball (MLB) to flourish since its introduction in late 2006. Using 
PITCHf/x, pitches have been classified by hand, requiring considerable effort, 
or using neural network clustering and classification, which is often difficult to 
interpret. To address these issues, we use model-based clustering with a mul- 
tivariate Gaussian mixture model and an appropriate adjustment factor as an 
alternative to current methods. Furthermore, we describe a new pitch classifi- 
cation algorithm based on our clustering approach to address the problems of 
pitch misclassification. We illustrate our methods for various pitchers from the 
PITCHf/x database that covers a wide variety of pitch types. 
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1 Introduction 



Perhaps more than any other technological contribution in baseball, the deployment 
of the PITCHf/x system has proven to be an invaluable resource to teams and to 
j> fans of the game in their statistical analyses of baseball, both in its original form 

and as augmented by estimates of pitch spin parameters (known as "psuedo-spin"). 
End users of the system, particularly brooksbaseball.net and MLB Advanced Media 
(MLB-AM), have classified these pitches into common pitch types, both by hand and 
using neural network classification methods. We hypothesize that the current method 
for classifying pitches in the database can be refined due to evidence from graphical 
exploration and previous research [1]. 

We improve the current pitch classification system by comparing the current neu- 
ral network classification results to these from various statistical clustering methods, 
such as k-means, hierarchical clustering, and model-based clustering with a multi- 
variate Gaussian mixture model (MBC). We ultimately propose an alternative basis 
for pitch clustering and classification. In Section 2 we describe the current meth- 
ods used and the PITCHf/x database. In Section 3, we introduce the model-based 



clustering method, which has several advantages for examining pitch types includ- 
ing pitches with high variance, pitch evolution across time, and pitches with similar 
characteristics. We also address implementation of the MBC method as well as the 
selection of the clusters and the stability of this clustering method. In Section 4, we 
propose a novel algorithm for classifying pitches based on their characteristics, before 
concluding with future work in Section 5. 

2 Data and Current Methods 

Publicly available PITCHf/x data is available from several sourcesF] Our data subset 
consists of pitches thrown by roughly 900 pitchers in the 2010 and 2011 seasons; we 
exclude data from before the 2010 season due to reported inconsistencies within the 
PITCHf/x system. One important point to make is that ground truth is not known, 
which is why we have measured stability of our method in Section 3.2, in terms of 
cluster memberships. 

The raw database contains trajectory information on each pitch, including ac- 
celeration and velocity, though not the spin of each pitch, which is useful in pitch 
classification because it can help distinguish between different pitch types. Some spin 
variables can be estimated using physics and a number of simplifying assumptions 
[3]; the name "psuedo-spin" is given to these quantities due to this. 

There are several important variable definitions that we use throughout our paper 
and in our figures: 

• The pitch's start speed measured in miles per hour at the release point. This 
is measured using radar guns and is commonly acknowledged as the speed of 
the pitch. 

• The back spin of the baseball measured in radians per second. A positive 
number represents back spin. Most fastballs have back spin, while off-speed 
pitches have a tendency to have more "top-spin" , or negative back spin. 

• The side spin of the baseball measured in radians per second. A positive num- 
ber represents left-to-right spin, or a left-handed pitcher's curveball or slider. A 
negative number represents right-to-left spin, or the direction of a right-handed 
pitcher's curveball or slider. 

Figure la plots these three variables for pitches thrown by Barry Zito; these pitches 
have been classified with the system developed via a neural network classification sys- 
tem for each pitcher developed by MLB Advanced Media. Specifics pertaining to the 



Our data was downloaded from http://www.wcintlinux.net/2009/10/ 



pitch-f x-data-with-pitch-type/. as well as data that we manually scrape from MLB.com with 
the aid of an edited Perl script and directions from the article How to Build a Pitch Database [2] 



method, model, and training data used are not publicly available [1]. 

We have reason to believe Barry Zito threw 5 types of pitches in the 2010 and 2011 
seasons due to the obvious 5 different clusters observed graphically, and according to 
Brooks Baseball, whose PITCHf/x classifications are considered by some the most 
accurate available to the public, those pitches are a four-seam fastball, sinker, slider, 
curveball, and changeup; but they do not make their method or data available, only 
their results [3]. Moreover, we speculate that the pitches in the two-seam fastball 
cluster should instead be classified as sinkers since Brooks Baseball classifies these 
pitches as such. The neural net classifies Brooks Baseball's sinker cluster as two- 
seam fastballs. We investigate whether or not Barry Zito's two-seam fastball should 
be labeled as a sinker in Section 4 by implementation of a new classification algo- 
rithm. We ultimately agree with Brooks Baseball and label the cluster as a sinker. 
Overall, our model tends to split up four-seam fastballs and two-seam/sinker clusters 
differently than the neural networks method in a way that more closely resembles 
Brooks Baseball classification and empirically makes more sense. We explore this 
further in Section 4 when we discuss the Cliff Lee classification example. 

In addition, the neural net classification appears to have obvious misclassifica- 
tions. Four-seam fastballs are classified as curveballs, and curveballs/changeups are 
classified as sliders, which are clearly labeled in the incorrect clusters in Figure la. 



3 Model-Based Clustering with Gaussian Mixture 
Models 

Mixture models for clustering rely on a straightforward generative premise: there is a 
series of simple probabilistic models for how an event can be generated, and a weight 
on which model will be used to generate the observation. This description lines up 
directly with pitcher intent: while we as observers may not know what type of pitch 
is intended, the pitcher himself makes a choice of a specific pitch type (fastball, slider, 
curveball, etc) with a basic profile: a grip and arm motion that gives the ball a desired 
speed, spin and trajectory. 

A multivariate Gaussian model for any particular pitch profile makes intuitive 
sense. Each coordinate has a mean value - for example, a typical four-seam fastball 
might have an initial velocity of 95 miles an hour, a back spin of 100 radians per 
second and a side spin of 10 radians per second. The resulting pitch is then affected 
by many different sources of noise, both in the pitcher's delivery and in other external 
factors like the wind, and this noise can affect multiple pitch characteristics at once. 
The resulting pattern in three dimensions is an ellipsoid; in Figures lb, 2, 3, and 5b, 
we visualize the MBC results for a variety of pitchers. One particular advantage to 
this approach is that we can detect when two clusters overlap, since geometry, and 



not just proximity, is an important factor. 

Given a set of Gaussian clusters and weights, this routine determines the proba- 
bility that each observed pitch belongs to a given cluster as the relative probability 
density for the pitch if it were a member of each cluster, factoring in each cluster's rel- 
ative weight - most pitchers, for example, throw more fastballs than other off-speed 
pitches, and this is taken into account directly. In our classifications we declare a 
pitch to belong to the class with the highest probability under the model. 

We use the mclust library in the R statistical programming software to perform 
our clustering operations, with some modifications that we describe to account for ad- 
ditional information. Given a pre-selected number of clusters, we use an Expectation- 
Maximization algorithm to calculate the maximum likelihood estimates (MLEs) for 
each cluster location, shape, and weight. Once we run this across a range of cluster 
counts, we use the Bayesian Information Criterion (BIC) model selection criteria to 
determine the optimal number of clusters in the model, which attempts to maximize 
the likelihood of the data while penalizing excessive numbers of parameters. 

Empirically, MBC creates clusters of pitches that are more tightly confined than 
current pitch classifications, suggesting there are very few curveballs, changeups, or 
sliders misclassified. A major concern is the difference between the four-seam fastball 
and the sinker /two-seam fastball clusters, as seen in the Figures section. We expect 
two-seam fastballs/sinkers to have a slightly slower start speed than four-seam fast- 
balls because it is known four-seam fastballs to be a pitchers fastest pitch, but this 
is not the case with the PITCHf/x neural networks classification method. However, 
MBC weights the velocity as an important factor between the two clusters and splits 
them accordingly. 

Figure lb shows the MBC as it performs for the pitches of Barry Zito; the red 
cluster is the faster pitch and light blue cluster is the slightly slower pitch with ver- 
tical spin closer to Zito's other off-speed pitches. The MBC's potential four-seam 
fastball cluster (red) has the smallest number of pitches, but other sources suggest 
that Zito threw his four-seam fastball most often, leading us to suspect that while 
the labels (from Section 4) may need adjusting, the proper clusters are still being de- 
tected. This also may indicate there is not much difference between Zito's four-seam 
and two-seam/sinker pitches from the batter's perspective. To further support subtle 
advantages of MBC over the neural networks classification, we evaluate the overall 
stability of our clustering method in Section 3.2. 



3.1 Calculating the Number of Clusters 

In its original form, the model tends to choose more clusters than fewer for the pitch- 
ers we have tested. On inspecting the clusters produced, it is clear that the method 
is favoring relatively "thin" clusters, which have high internal correlations between 



variables, which is highly unrealistic for the physical examples we consider. Limiting 
the model to a smaller number of pitches is not generally feasible, and while manual 
inspection is possible, it would be far more preferable to automate the method to 
remove this issue. 

We develop our own criterion for choosing the number of clusters, called the ad- 
justed Bayesian Information Criterion, or BIC a dj. Since we observe that most pitch 
clusters are close to spherical, and our prior knowledge suggests that flat ellipsoidal 
clusters are unlikely, we are motivated to constrain the creation clusters which have 
high intra-cluster correlations between the three variables of interest. Currently, each 
cluster k has three parameter sets: //&, the cluster mean; o^, the standard deviation 
of each dimension; and £&, the intra-cluster correlation matrix. It is the terms of E& 
that need to be kept small in absolute value (1 or -1 indicates perfect correlation, 
indicates no correlation). 

To account for this, we develop an additional penalty term for the current BIC 
formula that adds a value proportional to each intra-cluster correlation term. Using 
BIC a dj, if the clusters the model finds with k—6 have high intra-cluster correlations, 
compared to k—5, than the correlation penalty term will be large, and BIC a dj will be 
smaller for k=5. We choose our k based off of the minimum BIC or BIC a dj. 

Figure lb displays the clustering for Barry Zito chosen with adjusted BIC (BICadj), 
which chooses five clusters, which agrees with the pitch number selection of both the 
neural network classification and Brooks Baseball. For this application, BIC a dj is a 
substantial improvement over BIC, and is the method we use going forward in choos- 
ing the number of pitch clusters. 

In order to fully grasp all of the benefits the MBC method can offer, we inves- 
tigate how it performs when clustering pitchers other than Barry Zito. In fact, we 
have investigated MBC's performance on the entire 2010 and 2011 season, and simply 
show a few illustrations from this large analysis performed. 

Figure 2 shows how our method is ideal for detecting how a pitch can evolve over 
time. The purple and black clusters represent Jon Lester's curveballs. The difference 
between the two clusters is the speed and horizontal break of the pitch. In 2010, Jon 
Lester's curveball averaged 78 MPH while in 2011 it averaged 76 MPH with slightly 
more horizontal break - a subtle but important difference. Figure 3 visualizes how our 
method can also cluster pitches with high variances such as Tim Wakefield's knuckle- 
ball (orange); the pseudo-spins, corresponding to additional break, are considerably 
wider than for other pitches due to the unpredictable nature of stitch position. 



3.2 Stability Of Cluster Memberships 

We evaluate quality of the MBC method by assessing the stability of the method in 
terms of how sensitive our model is to smaller sample sizes. We go about accomplish- 
ing this by taking the data for each pitcher and running the clustering on an 80% 
subset of the data, the remaining 20% subset, and then on the full data. Next, we 
calculate the number of pitches in both the 80% and 20% subsets that do not change 
clusters compared to the full dataset. In order to be confident in our results, we repeat 
this process 20 times for each pitcher and find the mean and standard error. We find 
that after 20 samples our standard error is sufficiently small and we are confident with 
our stability sample mean estimates. Figure 4 is the two distributions for the 80% and 
20% subsets and the proportion of pitches that are in the same cluster as the full data 
for all pitchers. Overall, for both the 80% and 20% subsets the majority of pitchers 
have 80% or more of their pitches clustered in the same cluster. We found that this 
stability holds on sample sizes as low as 100 pitches, or the average number of pitches 
a starting pitcher throws in one start. It is important to note that we kept the num- 
ber of clusters chosen (k) the same on both subsets as when we ran it on the full data. 



4 Classification 

In order to have a fully automated pitch clustering and classification system, we pro- 
pose a simple classification algorithm, which improves pitch clustering by assigning 
sensible pitch labels to the respective clusters from our heuristic. Although ground 
truth is unknown, we make comparisons with the current classification labels in the 
PITCHf/x database, Brooks Baseball classifications, and graphical visualization to 
determine how well our method works. We found that our classification algorithm 
has many strengths including consistent performance and labeling multiple clusters 
as the same pitch type when appropriate, such as the Jon Lester curveball evolution 
over time scenario referenced in Section 3.2 and visualized in Figure 2. Our cluster- 
ing algorithm appears to have similar classification results as Brooks Baseball (which 
is the closest to the "truth" - that is, a pitcher's actual choice of pitch to throw - 
against which we can compare). 

Our algorithm classifies and names a key subset of pitches: four-seam fastballs, 
two-seam fastballs, sinkers, change-ups, cut-fastballs, sliders, curveballs, and knuck- 
leballs. For each classification, the algorithm begins by assigning the cluster mean 
with the highest starting velocity as a four-seam fastball. For each additional cluster, 
the algorithm goes through a series of constraints that are derived from each cluster's 
mean and variance to determine the label for each cluster. For example, if a cluster 
mean has the same side spin direction as the four-seam fastball then it checks if the 
speed differs from the four-seam fastball by more than 6 MPH and if the side spin 
varies less than 60 rotations per second from the four-seam fastballs. If it does, it 
assigns the cluster as a change-up. If not, it determines if the side or back spin dif- 



ference from the four-seam fastball is greater. If the side spin difference is the greater 
of the two, it assigns the cluster as a two-seam fastball; if the back spin difference 
is larger, it assigns the cluster as a sinker. The rest of the classification algorithm 
follows similar decision processes. 

We evaluate how well our classification algorithm performs by taking a sample 
of various starting and relief pitchers with differing pitch repertoires. We found af- 
ter analyzing and comparing the results to Brooks Baseball and the neural network 
classification system, in 23 of the 25 scenarios our classification algorithm empiri- 
cally appear correct and has similar classification results as Brooks Baseball. The 
two pitchers that the algorithm does not classify correctly are Barry Zito and Derek 
Lowe. It interchanges their sinker and four-seam fastball clusters because in the data 
sample, the measured speed of the sinkers are bigger than those for the four-seam 
fastball, a characteristic that is not commonly true for other pitchers. 

In Figure 5, we visualize Cliff Lee's MLB-AM neural network classification com- 
pared to our method. Of particular note, our method separates and labels the two- 
seam and four-seam fastball by assigning the four-seam fastball to the fastest pitch 
with the least amount of back and top spin relative to Lee's other off-speed pitches. 
Our method splits the two fastballs similar to Brooks Baseball, but Brooks Baseball 
instead labels the two-seam fastball as a sinker. We also are able to observe that our 
method's slider (brown) cluster is clearly defined and classified with no other obvious 
misclassified pitches. The neural network classification has obvious misclassification 
in the slider cluster with pitches labeled as changups, curveballs, and cut-fastballs. 
Our method clearly improves both the clustering and classification of Cliff Lee com- 
pared to the neural network classification. 



5 Summary and Future Work 

We have proposed a new clustering method and classification algorithm and tested 
both approaches using the PITCHf/x database for 2010-2011. Our analysis illustrates 
better clustering than the current neural network method based upon our plots and 
measuring the stability of our method. Furthermore, based upon our model based 
clustering method, we recommend a simple algorithm to classify each pitch based 
on the individual pitcher, which performs extremely well in most situations. Its 
strongest features are correcting any obvious misclassification in the neural network 
model, accounting for pitch evolution over time, and performing well for pitches with 
comparatively high internal variance. Our method also performs well in a highly de- 
bated topic in MLB, namely distinguishing two-seam and four-seam fastballs. 

These improvements suggest others can be made in assigning pitchers themselves 
to clusters, a problem highly motivated by the need to assess a pitcher's likely per- 
formance against unfamiliar players or teams. In these situations, most hitters likely 



have never faced the current pitcher, and thus, there is no or very little data to leading 
to inference about the hitter's history against the current pitcher. Placing pitchers 
into similar groups and looking for relationships with groups of batters would be in- 
valuable to MLB. This information can lead to advanced and more detailed scouting 
reports of what type of pitches are a hitter's strength or weakness. Moreover, since 
we have prior information about many pitchers, we can utilize advanced Bayesian 
clustering methods and optimize the choice of k for better and more stable results. 
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6 Appendix: Figures and Tables 



Table 1: Named pitch types with corresponding colors. 


Pitch Name 


Four Seam 


Two Scam 


Sinker 


Cut Fastball 


Changcup 


Color 


Rod 


Grey 


Light Blue 


Blue 


Green 


Pitch Name 


Curveball 


Knuckleball 


Slider 


Intentional Ball 




Color 


Black and Purple 


Orange 


Brown 


Yellow 





Barry Zito: Neural Network Classification 



MBC Barry Zito: Adjusted BIC 
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Figure 1: The pitches thrown by Barry Zito. Figure la (left) displays the MLB 
Advanced Media classification developed via a neural network system. There are 
obvious misclassifications: a small number of four-seam fastballs (which should be 
red) are classified as sliders (brown), as are some curveballs (black) and changeups 
(green). Figure lb (right) displays the MBC model and tightly clusters the pitches, 
as well as splits up the four-seam and sinker (light blue) clusters in an empirically 
sensible way. 
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Figure 2: The pitches thrown by Jon Lester classified using model-based clusters, 
which shows two distinct clusters for curveballs (purple and black) corresponding to a 
change over time. The difference between the two clusters is the speed and horizontal 
break of the pitch. This is one example of how the MBC model can detect subtle but 
important pitch evolution differences. 




Figure 3: The pitches thrown by Tim Wakefield classified using model-based clus- 
ters. The MBC model can also detect and classify pitches well with relatively higher 
variances like the pseudo-spins for Tim Wakefield's (spinless) knuckleball (orange), 
compared to the fastball (red) and curveball (black). 
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Stability for 80% Subset (Number of Pitchers = 473) 



Stability for 20% Subset (Number of Pitchers = 473) 
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Figure 4: 80% and 20% Stability: Percent of pitches in the same cluster in subset 
and full dataset, across 20 replications of the procedure. For both the 80% and 20% 
subsets the majority of pitchers have 80% or more of their pitches clustered in the 
same cluster. We also found that this stability holds on sample sizes as low as 100 
pitches, or the average number of pitches a starting pitchers throws in one start. 



Cliff Lee: Neural Network Classification 



MBC Cliff Lee: adjusted BIC 
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Figure 5: Cliff Lee: Figure 5a (left) displays MLB's classification system developed via 
a neural network classification system. As before, there are obvious misclassifications, 
but our method's clusters are clearly defined and classified with no other obviously 
misclassified pitches. The MBC model in Figure 5b (right) separates and labels the 
two-seam (grey) and four-seam (red) fastballs by assigning the four-seam fastball to 
the fastest pitch with the least amount of back and top spin relative to Lee's other 
off-speed pitches. This method agrees with the manually corrected data from Brooks 
Baseball. 
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